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Abstract. When augmented with the longest common prefix (LCP) 
array and some other structures, the suffix array can solve many string 
processing problems in optimal time and space. A compressed represen- 
tation of the LCP array is also one of the main building blocks in many 
compressed suffix tree proposals. In this paper, we describe a new com- 
pressed LCP representation: the sampled LCP array. We show that when 
used with a compressed suffix array (CSA), the sampled LCP array often 
offers better time/space trade-offs than the existing alternatives. We also 
show how to construct the compressed representations of the LCP array 
directly from a CSA. 

1 Introduction 

The suffix tree is one of the most important data structures in string processing 
and bioinformatics. While it solves many problems efficiently, its usefulness is 
limited by its size: typically 10-20 times the size of the text [18]. Much work has 
been put on reducing the size, resulting in data structures such as the enhanced 
suffix array ^Ij and several variants of the compressed suffix tree j23l22lllll9j . 

Most of the proposed solutions are based on three structures: 1) the suffix ar- 
ray, listing the suffixes of the text in lexicographic order; 2) the longest common 
prefix (LCP) array, listing the lengths of the longest common prefixes of lexi- 
cographically adjacent suffixes; and 3) a representation of suffix tree topology. 
While there exists an extensive literature on compressed suffix arrays (CSAlH 




[20] . less has been done on compressing the other structures. 

Existing proposals to compress the LCP information are based on the per- 
muted LCP (PLCP) array that arranges the entries in text order. While the 
PLCP array can be compressed, one requires expensive CSA operations to ac- 
cess LCP values through it. In this paper, we describe the sampled LCP array as 
an alternative to the PLCP-based approaches. Similar to the suffix array samples 
used in CSAs, the sampled LCP array often offers better time/space trade-offs 
than the PLCP-based alternatives. 

We also modify a recent PLCP construction algorithm [15] to work directly 
with a compressed suffix array. Using it, we can construct any PLCP represen- 
tation with negligible working space in addition to the CSA and the PLCP. A 
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self-index based on the Burrows- Wheeler transform. 




variant of the algorithm can also be used to construct the sampled LCP array, 
but requires more working space. While our algorithm is much slower than the 
alternatives, it is the first LCP construction algorithm that does not require ac- 
cess to the text and the suffix array. This is especially important for large texts, 
as the suffix array may not be available or the text might not fit into memory. 

We begin with basic definitions and background information in Sect. [2] Sec- 
tion |3] is a summary of previous compressed LCP representations. In Sect. |4] 
we show how to build the PLCP array directly from a CSA. We describe our 
sampled LCP array in Sect.jSj Section [6] contains experimental evaluation of our 
proposals. In Sect.jTj we compare the sampled LCP array to direct compression 
of the LCP values. We finish with conclusions and discussion on future work in 
Sect. E 

2 Background 

A string S — S[l, n] is a sequence of characters from alphabet S — {1, 2, . . . , ct}. 
A substring of S is written as S[i,j]. A substring of type 5'[1, j] is called a prefix, 
while a substring of type (^[i,^] is called a suffix. A text string T = T[\,n\ is a 
string terminated by T[n] —% ^ S with lexicographic value 0. The lexicographic 
order " <" among strings is defined in the usual way. 

The suffix array (SA) of text T[l,n] is an array of pointers SA[l,n] to the 
sufRxes of T in lexicographic order. As an abstract data type, a suffix array is 
any data structure with similar functionality as the concrete suffix array. This 
can be defined by an efficient support for the following operations: (a) count the 
number of occurrences of a pattern in the text; (b) locate these occurrences (or 
more generally, retrieve a suffix array value); and (c) display any substring of T. 

Compressed suffix arrays (CSA) [13)8j support these operations. Their com- 
pression is based on the Burrows-Wheeler transform (BWT) [3 , a permutation 
of the text related to the SA. The BWT of text T is a sequence L[l, n] such that 
L[i] = T[Sk[i] - 1], if Sk[i\ > 1, and L[i\ = T[n] = $ otherwise. 

The Burrows- Wheeler transform is reversible. The reverse transform is based 
on a permutation called LF-mapping j SISj . Let C[1,ct] be an array such that 
C[c] is the number of characters in{$,l,2,...,c — 1} occurring in the text. For 
convenience, we also define C[0] = and C[a + 1] = n. By using this array and 
the sequence L, we define L_F-mapping as LF(i) = C[L[i\\-\-ranki^\^{^ (L, i), where 
rankc{L,i) is the number of occurrences of character c in prefix 

The inverse of Li^-mapping is 'F{i) — selectc{L,i — C[c]), where c is the 
highest value with C[c] < i, and selectc{L, j) is the position of the jth occurrence 
of character c in L |13| . By its definition, function ^ is strictly increasing in the 
range tf'c = [C[c] + l,C[c + 1]] for every c e S. Additionally, r[SA[i]] = c and 
L['I'{i)] = c for every i G 'Fc- 

These functions form the backbone of CSAs. As SA[LF{i)] = SA\i] - 1 [5] 
and hence SA['F{i)] = SA[i] + 1, we can use these functions to move the sufRx 
array position backward and forward in the sequence. Both of the functions can 
be efficiently implemented by adding some extra information to a compressed 



representation of the BWT. Standard techniques [20] to support sufRx array 
operations include backward searching 8J for county and adding a sample of 
sufRx array values for locate and display. 

Let lcp{A, B) be the length of the longest common prefix of sequences A and 
B. The longest common prefix (LCP) array of text r[l, n] is the array LCP[1, n] 
such that LCP[1] = and LCP[i] = lcp{T[S^[i - l],n],T[Sk[i],n]) for i > 1. The 
array requires nlogn bits of space, and can be constructed in 0{n) time |16)15j . 

3 Previous Compressed LCP Representations 

We can exploit the redundancy in LCP values by reordering them in text order. 
This resuhs in the permuted LCP (PLCP) array, where PLCP[SA[i]] = LCP[i]. 
The following lemma describes a key property of the PLCP array. 

Lemma 1 (; |T6llT5] V For every i e {2, . . .,n}, PLCP[i] > PLCP[i - 1] - 1. 

As the values PLCP[i] + 2i form a strictly increasing sequence, we can store 
the array in a bit vector of length 2n [23] . Various schemes exist to represent 
this bit vector in a succinct or compressed form |23I11I19] . 

Space-efficiency can also be achieved by sampling every gth PLCP value, and 
deriving the missing values when needed |17i . Assume we have sampled PLCP[aq'] 
and PLCP[(a + and we want to determine PLCP[a(7 + h] for some b < q. 
Lemma [l] states that PLCP[aq] - b < PLCP[ag + 6] < PLCP[(a + l)q] + q - b, 
so at most q + PLCP[(a + l)q] — PLCP[ag] character comparisons are required 
to determine the missing value. The average number of comparisons over all 
entries is 0{q) IT^. By carefully selecting the sampled positions, we can store 
the samples in o(n) bits, while requiring only 0{log^ n) comparisons in the worst 
case for any Q < 5 <1 [TU] . 

Unfortunately these compressed representations are not very suitable for use 
with CSAs. The reason is that the LCP values are accessed through sufiix array 
values, and locate is an expensive operation in CSAs. In addition to that, sampled 
PLCP arrays require access to the text, using the similarly expensive display. 

Assume that a CSA has SA sample rate d, and that it computes in time 
t^l,. To retrieve SA[«], we compute i, ^{i),^'^{i), . . . , until we find a sampled suflax 
array value. If the sampled value was SA[tf''^(i)] = j, then Sk[i] — j — k. We find a 
sample in at most d steps, so the time complexity for locate is 0{d-t^). Similarly, 
to retrieve a substring T[i,i + I], we use the samples to get SA~^[(i • [gj]. Then 
we iterate the function ^ until we reach text position i + I. This takes at most 
d + I iterations, making the time complexity for display 0{ {d + I) ■ t^). From 
these bounds, we get the PLCP access times shown in Table l]^ 

Depending on the type of index used, varies from 0(1) to O(logn) in the 
worst case [20], and is close to 1 microsecond for the fastest indexes in practice 
|7I19) . This is significant enough that it makes sense to keep tip in Table [l] 



^ Some CSAs use L_F-mapping instead of 'F, but similar results apply to them as well. 



Table 1. Time/space trade-offs for (P)LCP representations. R is the number 
of equal letter runs in BWT, q is the PLCP sample rate, and < (5 < 1 is a 
parameter. The numbers for CSA assume access time and SA sample rate 
d. 

Access times 

Representation Space (bits) Using SA Using CSA 

LCP nlogn 0(1) 0(1) 

PLCP [13] 2n + o(n) 0(1) 0{d-t^) 

PLCP [11] 2Rlog ^+0{R)+o(n) 0(1) 0{d-tw) 

PLCP 2i?logf +0(i?loglogf ) O(loglogn) 0(d log log n) 

Sampled PLCP [TOl o(n) 0(log^ n) 0((d + log* n) ■ 

Sampled PLCP ilf ^ logn 0(g) 0{{d + q) ■ t^) 



The only (P)LCP representation so far that is especially designed for use 
with CSAs is Fischer's Wee LCP [10] that is basically the select structure from 
Sadakane's bit vector representation [231. When the bit vector itself would be 
required to answer a query, some characters of two lexicographically adjacent 
suffixes are compared to determine the LCP value. This increases the time com- 
plexity, while reducing the size significantly. In this paper, we take the other 
direction by reducing the access time, while achieving similar compression as in 
the run-length encoded PLCP variants |ll|19j . 

4 Building the PLCP Array from a CSA 

In this section, we adapt the irreducible LCP algorithm [15j to compute the 
PLCP array directly from a CSA. 

Definition 1. For i > 1, the left match of suffix r[SA[z],7i] is r[SA[z — l],n]. 

Definition 2. Let T[j,n] be the left match of T[i,n]. PLCP[i] is reducible, if 
i,j>l and T[i — 1] = T[j — 1]. // PLCP[i] is not reducible, then it is irreducible. 

The following lemma shows why reducible LCP values are called reducible. 

Lemma 2 ([15j). // PLCP[i] is reducible, then PLCP[i] = PLCP[i - 1] - 1. 

The irreducible LCP algorithm works as follows: 1) find the irreducible PLCP 
values; 2) compute them naively; and 3) fill in the reducible values by using 
Lemma [2] As the sum of the irreducible values is at most 2nlogn, the algorithm 
works in 0{n\ogn) time |15j . 

The original algorithm uses the text and its suffix array that are expensive 
to access in a CSA. In the following lemma, we show how to find the irreducible 
values by using the function W instead. 

Lemma 3. Let T[j,n\ be the left match of T[i,n\. The value PLCP[i + 1] is 
reducible if and only ifT\i] = T[j] and W{SA'^^[j]) = ^{5A~^\i]) - 1. 



— Compute the PLCP array 

1 PLCP[1] ^ 

2 {i,x)^{l,SA-^[l]) 

3 while i < n 

4 li'c rangeContaining(a::) 

5 if x - 1 ^ iZ^c or <I'{x - 1) / 'I'{x) 

6 PLCP[i + l] ^ lcp(*'(a;)) 

7 else PLCP[i + 1] ^ PLCP[i] - 1 

8 {i,x) ^ {i + l,<P{x)) 



— Compute an LCP value 

9 def lcp(fe) 

10 (a,fc) ^ (6- 1,0) 

11 li'c rangeContaining(6) 

12 while a eiPc 

13 {a,b,k) ^ (<P{a),>P{b),k + l) 

14 'I'c <~ rangeContammg(6) 

15 return k 



Fig. 1. The irreducible LCP algorithm for using a CSA to compute the PLCP 
array. Function rangeContaining(x) returns tf'c = [^[cj + l, C[c+1]] where x € 'Pc- 



Proof. Let x = SA~^[i\. Then x - 1 = SA"^[j]. 

"//. " Assume that T\i] = T[j] and !f (x - 1) = ^{x) - 1. Then the left match 
of T[SA[^{x)],n] = T[i + l,n] is T[SA[i'{x - = T[j + 1, n]. As i + 1 > 1 

and j + 1 > 1, it follows that PLCP[i + 1] is reducible. 

"Only if." Assume that PLCP[i + 1] is reducible, and let T[k,n] be the left 
match of T[i + 1, n]. Then fc > 1 and T[k - 1] = T[i\. As T[k - 1, n] and T[i, n] 
begin with the same character, and T[k, n] is the left match of T[i + l.n], there 
cannot be any sufhx S such that T[k — 1, n] < S < T[i, n\. But now j = k — 1., 
and hence T[i] = T[j]. Additionally, 

^{5A-\j]) = If (SA-i[fc - 1]) = SA-\k] = 5A-^[i + !]-! = <f (SA-^^) - 1. 

The lemma follows. □ 

The algorithm is given in Fig. [l] We maintain invariant x = SA^^[i], and 
scan through the CSA in text order. If the conditions of Lemma [3] do not hold 
for T[i, n], then PLCP[i + 1] is irreducible, and we have to compute it. Otherwise 
we reduce PLCP[z + 1] to PLCP[j]. To compute an irreducible value, we iterate 
- 1), for = 0, 1, 2, . . . , until T[if '=(6 - 1)] ^ T[^^{b)]. When this 

happens, we return k as the requested LCP value. As we compute W{ ) for a 
total of 0(n log n) times, we get the following theorem. 

Theorem 1. Given a compressed suffix array for a text of length n, the irre- 
ducible LCP algorithm computes the PLCP array in 0{n\ogn ■ t^p) time, where 
is the time required for accessing W. The algorithm requires O(logn) bits of 
working space in addition to the CSA and the PLCP array. 

We can use the algorithm to build any PLCP representation from Table [T] di- 
rectly. The time bound is asymptotically tight, as shown in the following lemma. 

Lemma 4 (Direct extension of Lemma 5 in (15j ). For an order-k de Bruijn 
sequence on an alphabet of size a, the sum of all irreducible PLCP values is 
n(l — l/cr) log^ n — 0{n). 



The sum of irreducible PLCP values of a random sequence should also be 
close to n(l — l/cr) log^. n. The probability that the characters preceding a sufRx 
and its left match differ, making the PLCP value irreducible, is (1 — l/cr). On 
the other hand, the average irreducible value should be close to log^ n For a 
text generated by an order-fc Markov source with H bits of entropy, the estimate 
becomes n{l — l/a'){\ogn)/H. Here a' is the effective alphabet size, defined by 
the probability l/tr' that two characters sharing an order-fc context are identical. 

The following proposition shows that large-scale repetitiveness reduces the 
sum of the irreducible values, and hence improves the algorithm performance. 

Proposition 1. For a concatenation of r copies of text T[l,n\, the sum of irre- 
ducible PLCP values is s + (r ~ l)n, where s is the sum of the irreducible PLCP 
values ofT. 

Proof. Let T — T1T2 • • • be the concatenation, Ta.i the suffix starting at Ta[i], 
and PLCPq[«] the corresponding PLCP value. Assume that Tr[n] is lexicograph- 
ically greater than the other end markers, but otherwise identical to them. 

For every i, the suffix array of T contains a range with values 7i,i , 72, i , . ■ . , 7^,i 
[19] . Hence for any a > 1 and any i, the left match of Ta,i is Ta^i,i, making the 
PLCP values reducible for almost all of the suffixes of T2 to T^. The exception is 
that 72,1 is irreducible, as its left match is 7i,i, and hence PLCP2[1] = (r — l)n. 

Let T[j, n] be the left match of r[i, n] in the suffix array of T. Then the left 
match of 7i.i is Trj, and PLCPi[i] = PLCP[i]. Hence the sum of the irreducible 
values corresponding to the suffixes of Ti is s. □ 

5 Sampled LCP Array 

By Lemmas [T] and [2] the local maxima in the PLCP array are among the irre- 
ducible values, and the local minima are immediately before them. 

Definition 3. The value PLCP[j] is maximal, if it is irreducible, and minimal, 
if either i — n or PLCP[z + 1] is maximal. 

Lemma 5. // PLCP[i] is non-minimal, then PLCP[i] = PLCP[i + 1] + 1. 

Proof. If PLCP[i] is non-minimal, then PLCP[i+l] is reducible. The result follows 
from Lemma [21 □ 

In the following, R is the number of equal letter runs in BWT. 

Lemma 6. The number of minimal PLCP values is R. 

Proof. Lemma [3] essentially states that PLCP[i -I- 1] is reducible, if and only if 
L[^iSA-'m = T[i] = T[j] = L[^iSA-'[j])] = L[^{SA-'\i]) - 1], where T[j,n] 
is the left match of T[i, n]. As this is true for n — i? positions i, there are exactly 
R irreducible values. As every maximal PLCP value can be reduced to the next 
minimal value, and vice versa, the lemma follows. □ 



Lemma 7. The sum of minimal PLCP values is S — {n — R), where S is the 
sum of maximal values. 

Proof. From Lemmas [5] and [6] □ 

If we store the minimal PLCP values in SA order, and mark their positions 
in a bit vector, we can use them in a similar way as the SA samples. If we 
need LCP[z], and LQP\^^{i)] is a sampled position for the smallest fc > 0, then 
LCP[j] = LCP^^ {i)] -\-k. As k can be 0{n) in the worst case, the time bound is 
0{n-U). 

To improve the performance, we sample one out of d! = nlR}~'^ consecutive 
non-minimal values for some e > 0. Then there are R minimal samples and 
at most R}~^ extra samples. We mark the sampled positions in a bit vector of 
Raman et al. 21! , taking at most (1 + o(l)) • i?log ^ + 0{K) +o(n) bits of space. 
Checking whether an LCP entry has been sampled takes 0(1) time. 

We use (5 codes [5] to encode the actual samples. As the sum of the minimal 
values is at most 2?! log n, these samples take at most 

R log ^ + O [i? log log - j < i? log - + 0(i? log log n) 

bits of space. The extra samples require at most logn + O (log log n) bits each. 
To provide fast access to the samples, we can use dense sampling [S] or directly 
addressable codes :2 . This increases the size by a factor of 1 + o(l), making the 
total for samples (1 + o(l)) • i?log ^ + 0(i? log logn) + o(i? logn) bits of space. 

We find a sampled position in at most njR}^^ steps. By combining the size 
bounds, we get the following theorem. 

Theorem 2. Given a text of length n and a parameter < e < 1, the sampled 
LCP array requires at most (2 + 0(1)) •i?log ^+0(i?loglogn)+o(i?logn) + o(n) 
hits of space, where R is the number of equal letter runs in the BWT of the text. 
When used with a compressed suffix array, retrieving an LCP value takes at most 
0{{n/ R^~^) ■ t^) time, where t^ is the time required for accessing ^. 

By using the BSD representation fl4' for the bit vector, we can remove the 
o(n) term from the size bound with a slight loss of performance. 

When the space is limited, we can afford to sample the LCP array denser 
than the SA, as SA samples are larger than LCP samples. In addition to the 
mark in the bit vector, an SA sample requires 2 log ^ bits of space, while an LCP 
sample takes just \ogv + 0(loglog v) bits, where v is the sampled value. 

The LCP array can be sampled by a two-pass version of the irreducible 
LCP algorithm. On the first pass, we scan the CSA in suffix array order to 
find the minimal samples. Position x is minimal, if x is the smallest value in 
the corresponding !?c, or if 'lf{x — 1) ^ !f'(a;) — 1. As we compress the samples 
immediately, we only need O(logn) bits of working space. On the second pass, 
we scan the CSA in text order, and store the extra samples in an array. Then 
we sort the array to SA order, and merge it with the minimal samples. As the 
number of extra samples is o(i?), we need o(i?logn) bits of working space. 



Theorem 3. Given a compressed suffix array for a text of length n, the modified 
irreducible LCP algorithm computes the sampled LCP array in 0(n log n • t^) 
time, where t^ is the time required for accessing ^ . The algorithm requires 
o{R\ogn) bits of working space in addition to the CSA and the samples, where 
R is the number of equal letter runs in the B WT of the text. 



6 Implementation and Experiments 

We have implemented the sampled LCP array, a run-length encoded PLCP array, 
and their construction algorithms as a part of the RLCSA :24, For PLCP, we 
used the same run-length encoded bit vector as in the RLCSA. For the sampled 
LCP, we used a gap encoded bit vector to mark the sampled positions, and a 
stripped-down version of the same vector for storing the samples. 

To avoid redundant work, we compute minimal instead of maximal PLCP 
values, and interleave their computation with the main loop. To save space, we 
only use strictly minimal PLCP values with PLCP[i] < PLCP[« + 1] -I- 1 as the 
minimal samples. When sampling the LCP array, we make both of the passes in 
text order, and store all the samples in an array before compressing them. 

For testing, we used a 2.66 GHz Intel Core 2 Duo E6750 system with 4 GB 
of memory (3.2 GB visible to OS) running a Fedora-based Linux with kernel 
2.6.27. The implementation was written in C-I--I-, and compiled on g-| — h version 
4.1.2. We used four data sets: human DNA sequences {dna) and English language 
texts (english) from the Pizza & Chili Corpus [7], the Finnish language Wikipedia 
with version history (fiwiki) 124] , and the genomes of 36 strains of Saccharomyces 
paradoxus {yeast) [19 |^ When the data set was much larger than 400 megabytes, 
a 400 MB prefix was used instead. Further information on the data sets can be 
found in Table H 

Only on the dna data set, the sum of the minimal values was close to the 
entropy-based estimate. On the highly repetitive fiwiki and yeast data sets, the 
difference between the estimate and the measurement was very large, as pre- 
dicted by Proposition [T] Even regular English language texts contained enough 
large-scale repetitiveness that the sum of the minimal values could not be ad- 
equately explained by the entropy of the texts. This suggests that, for many 
real-world texts, the number of runs in BWT is a better compressibility measure 
than the empirical entropy. 

The sum of minimal PLCP values was a good estimate for PLCP construction 
time. LCP sampling was somewhat slower because of the second pass. Both 
algorithms performed reasonably well on the highly repetitive data sets, but 
were much slower on the regular ones. The overall performance was roughly an 
order of magnitude worse than for the algorithms using plain text and SA |15j . 



^ The implementation is available at http://www.cs.helsiiiki.fi/group/suds/ 



rlcsa/ 

^ 'i'he yeast genomes were obtained from the Durbin Research Group at the Sanger 
Institute (http : //www. Sanger . ac .uk/Teams/TeamTl/durbin/sgrp/) . 
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Fig. 2. Time/space trade-offs for retrieving an LCP or SA value. The times 
are averages over 10^ random queries. Sampled LCP results are grouped by SA 
sample rate. 



We measured the performance of the sampled LCP array and the run-length 
encoded PLCP array on each of the data sets. We also measured the locate 
performance of the RLCSA to get a lower bound for the time and space of any 
PLCP-based approach. The results can be seen in Fig. [2j 

The sampled LCP array outperformed PLCP on english and dna, where most 
of the queries were resolved through minimal samples. On fiwiki and yeast, the 
situation was reversed. As many extra samples were required to get reasonable 
performance, increasing the size significantly, the sampled LCP array had worse 
time/space trade-offs than the PLCP array. 

While we used RLCSA in the experiments, the results generalize to other 
types of CSA as well. The reason for this is that, in both PLCP and sampled 
LCP, the time required for retrieving an LCP value depends mostly on the 
number of iterations of W required to find a sampled position. 



7 Comparison with Direct LCP Compression 



In a recent proposal ^ , the entire LCP array was compressed by using directly 
addressable codes (DAC-LCP) The resulting structure was much faster than 
the other compressed LCP representations, requiring less than a microsecond to 
access an LCP value. On the other hand, DAC-LCP was also much larger: 6 to 
8 bits per character. 

To compare the sampled LCP array to DAC-LCP, we downloaded the same 
data sets as DAC-LCP was tested on. This included 100 MB prefixes of XML 
data (xml), human DNA and protein sequences (dna and proteins) and source 
code {sources) from Pizza & Chili Corpus. We then sampled the LCP array with 
sample rate 16 on each of the data sets. DAC-LCP sizes were visually estimated 
from the reported results [3]. The results are in Table [3) 

In addition to the implemented version of the sampled LCP array (Sam- 
pled), we also estimated the size of another variant (Sampled 2). Instead of a 
gap-encoded bit vector to mark the sampled positions, we use the rank/select 
implementation of Gonzlez [12] on a plain bit vector. This takes a total of 1.05n 
bits for a text of length n. To store the samples, we use directly addressable 
codes, estimating the average size of an LCP value to be the same as in DAC- 
LCP. These results can also be found in Table |3l 

On each of the data sets, the sampled LCP array was clearly smaller than 
DAC-LCP. The difference was smaller on the mostly random dna and proteins 
data sets than on the more structured sources and xml data sets. The reason 
is that on random data, LCP values are small and the number of samples is 
large, decreasing the size of DAC-LCP and increasing the size of the sampled 
LCP array, respectively. This also suggests that DAC-LCP becomes very large 
on highly repetitive data sets, where most of the LCP values are large. 



8 Discussion 



We have described the sampled LCP array, and shown that it offers better 
time/space trade-offs than the PLCP-based alternatives, when the number of 
extra samples required for dense sampling is small. Based on the experiments, 
it seems that one should use the sampled LCP array for regular texts, and a 
PLCP-based representation for highly repetitive texts. DAC-LCP is also a good 
choice for regular texts, if performance is more important than compression. 

We have also shown that it is feasible to construct the (P)LCP array directly 
from a CSA. While the earlier algorithms are much faster, it is now possible to 
construct the (P)LCP array for larger texts than before, and the performance is 
still comparable to that of direct CSA construction [21]. On a multi-core system, 
it is also easy to get extra speed by parallelizing the construction. 

It is possible to maintain the (P)LCP array when merging two CSAs. The 
important observation is that an LCP value can only change, if the left match 
changes in the merge. An open question is, how much faster the merging is, both 
in the worst case and in practice, than rebuilding the (P)LCP array. 

While the suffix array and the LCP array can be compressed to a space 
relative to the number of of equal letter runs in BWT, no such representation is 
known for suffix tree topology. This is the main remaining obstacle in the way 
to compressed suffix trees optimized for highly repetitive texts. 
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Table 2. Properties of the data sets. is the order-5 empirical entropy, a' 

the corresponding effective alphabet size, ^ the number of (strictly) minimal 
values, and S the sum of those values. S' = n(l — l/cr')(logn)/il5 — n/a' is 
an entropy-based estimate for the sum of the minimal values. The construction 
times are in seconds. 



Estimates Minimal values Strictly minimal 

Name MB H5 cr' S'/IO'' #/10'' S/IO" S/n #/10^ S/IO" S/n 



english 


400 


1.86 


2.09 


3167 


156.35 


1736 


4.14 


99.26 


1052 


2.51 


fiwiki 


400 


1.09 


1.52 


3490 


1.79 


273 


0.65 


1.17 


117 


0.28 


dna 


385 


1.90 


3.55 


4252 


243.49 


3469 


8.59 


158.55 


2215 


5.48 


yeast 


409 


1.87 


3.34 


4493 


15.64 


520 


1.21 


10.05 


299 


0.70 



Sample rates PLCP Sampled LCP 



Name 


SA 


LCP 


Time 


MB/s 


Time 


MB/s 


english 


8, 16, 32, 64 


8, 16 


1688 


0.24 


2104 


0.19 


fiwiki 


64, 128, 256, 512 


32, 64, 128 


327 


1.22 


533 


0.75 


dna 


8, 16, 32, 64 


8, 16 


3475 


0.11 


3947 


0.10 


yeast 


32, 64, 128, 256 


16, 32, 64 


576 


0.71 


890 


0.46 



Table 3. The sizes of DAC-LCP and two versions of the sampled LCP array in 
bits per character on 100 MB data sets. # is the total number of samples. 



Name 


#/n 


Sampled 


Sampled 2 


DAC-LCP 


dna 


0.41 


5.48 


3.44 


5.8 


proteins 


0.44 


4.39 


4.16 


7.0 


sources 


0.17 


2.33 


2.33 


7.5 


xml 


0.13 


1.85 


2.09 


7.8 



